Skip to content

Enable split mode graph for on-the-fly merged up/gate experts#1413

Merged
ikawrakow merged 3 commits intomainfrom
ik/sm_graph_muge
Mar 13, 2026
Merged

Enable split mode graph for on-the-fly merged up/gate experts#1413
ikawrakow merged 3 commits intomainfrom
ik/sm_graph_muge

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Mar 12, 2026

Once at it, this PR is a follow up of #1412. It enables usage of on-the-fly merged ffn_up/gate_exps tensors (-muge command line option) with split mode graph.

On a 2x3090 system, I see ~10% better PP for the few models I tested.

As a reminder: add -sm graph -muge to the command line to get the benefit of this PR.

Here a sweep-bench for GPT-OSS-20B-MXFP4 on the 2x3090 system. The llama.cpp results are with build 8314.

gpt_oss_pp

Nexesenex pushed a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 12, 2026
…ow#1413

Split mode graph for on-the-fly merged ffn_up/gate_exps

Cleanup

Also handle merged bias
@ubergarm
Copy link
Contributor

This quick test is showing -muge giving a boost +6.8% at short kv-cache depth and +2.8% faster near 128k depth with my Qwen3.5-122B-A10B-IQ4_KSS. Its been running well doing some light testing today including mmproj. 🚀

sweep-bench-Qwen3 5-122B-A10B-IQ4_KSS-PR1413
👈 Details

title: "ik_llama.cpp PR1413 ik/sm_graph_muge@c046f7f3"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm graph \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 1.348 3039.33 1.650 77.58
4096 128 4096 1.400 2925.21 1.660 77.10
4096 128 8192 1.450 2825.11 1.677 76.31
4096 128 12288 1.504 2723.43 1.710 74.86
4096 128 16384 1.570 2609.23 1.716 74.59
4096 128 20480 1.629 2514.64 1.725 74.21
4096 128 24576 1.688 2426.82 1.748 73.22
4096 128 28672 1.747 2345.06 1.754 72.96
4096 128 32768 1.797 2278.90 1.780 71.91
4096 128 36864 1.859 2203.43 1.786 71.69
4096 128 40960 1.914 2139.97 1.792 71.42
4096 128 45056 1.967 2082.38 1.817 70.43
4096 128 49152 2.023 2024.23 1.824 70.19
4096 128 53248 2.069 1979.35 1.833 69.82
4096 128 57344 2.113 1938.40 1.852 69.10
4096 128 61440 2.169 1888.08 1.859 68.87
4096 128 65536 2.218 1846.38 1.881 68.05
4096 128 69632 2.272 1802.70 1.889 67.75
4096 128 73728 2.323 1763.21 1.893 67.63
4096 128 77824 2.382 1719.76 1.921 66.64
4096 128 81920 2.433 1683.82 1.927 66.42
4096 128 86016 2.478 1652.73 1.943 65.89
4096 128 90112 2.530 1619.04 1.958 65.37
4096 128 94208 2.581 1586.89 1.966 65.12
4096 128 98304 2.637 1553.40 1.988 64.40
4096 128 102400 2.695 1520.12 1.994 64.19
4096 128 106496 2.741 1494.57 2.003 63.90
4096 128 110592 2.794 1466.23 2.024 63.26
4096 128 114688 2.847 1438.82 2.028 63.13
4096 128 118784 2.899 1413.02 2.054 62.32
4096 128 122880 2.954 1386.81 2.059 62.17
4096 128 126976 3.002 1364.55 2.063 62.03
4096 128 131072 3.056 1340.13 2.087 61.34

-sm graph -muge

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm graph \
  -muge \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 1.262 3246.57 1.641 77.98
4096 128 4096 1.313 3120.22 1.646 77.79
4096 128 8192 1.355 3022.54 1.661 77.05
4096 128 12288 1.409 2906.77 1.692 75.66
4096 128 16384 1.470 2785.99 1.700 75.28
4096 128 20480 1.526 2684.02 1.706 75.03
4096 128 24576 1.586 2582.02 1.729 74.02
4096 128 28672 1.643 2493.31 1.738 73.63
4096 128 32768 1.699 2410.52 1.766 72.48
4096 128 36864 1.753 2336.64 1.769 72.36
4096 128 40960 1.807 2266.81 1.775 72.11
4096 128 45056 1.853 2210.21 1.798 71.21
4096 128 49152 1.913 2140.88 1.803 71.01
4096 128 53248 1.960 2089.72 1.814 70.57
4096 128 57344 2.013 2034.95 1.835 69.74
4096 128 61440 2.070 1979.21 1.844 69.43
4096 128 65536 2.115 1936.36 1.867 68.57
4096 128 69632 2.174 1884.38 1.874 68.31
4096 128 73728 2.228 1838.02 1.881 68.05
4096 128 77824 2.280 1796.17 1.906 67.14
4096 128 81920 2.331 1756.85 1.908 67.09
4096 128 86016 2.384 1718.11 1.921 66.63
4096 128 90112 2.435 1681.97 1.938 66.05
4096 128 94208 2.489 1645.42 1.944 65.86
4096 128 98304 2.549 1606.77 1.967 65.06
4096 128 102400 2.596 1577.51 1.976 64.77
4096 128 106496 2.656 1541.96 1.983 64.54
4096 128 110592 2.708 1512.47 2.008 63.75
4096 128 114688 2.760 1484.08 2.013 63.59
4096 128 118784 2.812 1456.69 2.038 62.81
4096 128 122880 2.874 1425.21 2.048 62.50
4096 128 126976 2.922 1401.85 2.051 62.42
4096 128 131072 2.972 1377.98 2.076 61.65

I don't have a mainline compatible quant handy to test on this rig, but have seen that issue where sweep-bench fails towards the end on mainline too as you mentioned in another PR.

@hksdpc255
Copy link
Contributor

Is on-the-fly merged ffn_up/gate_exps ggufs faster than just using --merge-up-gate-experts on a non-merged ggufs?

@ikawrakow
Copy link
Owner Author

Is on-the-fly merged ffn_up/gate_exps ggufs faster than just using --merge-up-gate-experts on a non-merged ggufs?

It is the same thing. I started using "on-the-fly merged" for --merge-up-gate-experts to distinguish from the case where these tensors have already been merged in the model stored on disk (see issue #1399 that you entered yourself)

@hksdpc255
Copy link
Contributor

I'm confused. Is that means using pre-merged ggufs is same as using non-merged ggufs with option -muge?

@ikawrakow
Copy link
Owner Author

I'm confused. Is that means using pre-merged ggufs is same as using non-merged ggufs with option -muge?

Yes, it is the same. In the pre-merged case, someone (for instance AesSedai) has prepared the model such that the ffn_up_exps and ffn_gate_exps tensors are merged into a single ffn_gate_up_exps tensor and stored the model that way on disk. In that case, when we load the model we don't need to do anything to take advantage of the merge. With the on-the-fly merge (-muge) the model stored on disk contains separate ffn_up_exps and ffn_gate_exps tensors, and we merge them on-the-fly while loading the model. The end result (i.e., what happens during inference) is exactly the same.

@hksdpc255
Copy link
Contributor

CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.465 4400.22 3.652 140.18
2048 512 2048 0.415 4929.12 3.700 138.36
2048 512 4096 0.413 4962.78 3.740 136.88
2048 512 6144 0.419 4892.39 3.785 135.25
2048 512 8192 0.425 4820.18 3.827 133.79
2048 512 10240 0.430 4765.75 3.893 131.52
2048 512 12288 0.438 4678.00 3.939 130.00
2048 512 14336 0.445 4607.09 3.992 128.27
2048 512 16384 0.450 4555.36 4.042 126.68
2048 512 18432 0.457 4485.72 4.096 124.99
2048 512 20480 0.464 4418.29 4.153 123.28
2048 512 22528 0.469 4365.78 4.237 120.85
2048 512 24576 0.476 4299.84 4.226 121.14
2048 512 26624 0.482 4251.36 4.263 120.09
2048 512 28672 0.488 4200.72 4.290 119.35
2048 512 30720 0.493 4151.75 4.323 118.43
2048 512 32768 0.505 4055.22 4.361 117.39
CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate -muge
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.440 4654.74 3.676 139.28
2048 512 2048 0.393 5216.47 3.725 137.45
2048 512 4096 0.388 5274.11 3.760 136.19
2048 512 6144 0.396 5171.61 3.805 134.57
2048 512 8192 0.403 5081.13 3.850 132.97
2048 512 10240 0.408 5015.72 3.901 131.26
2048 512 12288 0.414 4941.12 3.942 129.87
2048 512 14336 0.421 4862.68 4.003 127.91
2048 512 16384 0.428 4787.63 4.057 126.20
2048 512 18432 0.435 4708.26 4.112 124.52
2048 512 20480 0.440 4649.41 4.183 122.41
2048 512 22528 0.447 4582.38 4.271 119.87
2048 512 24576 0.453 4522.47 4.298 119.13
2048 512 26624 0.459 4460.42 4.309 118.81
2048 512 28672 0.465 4407.75 4.338 118.04
2048 512 30720 0.470 4361.50 4.371 117.13
2048 512 32768 0.476 4298.20 4.402 116.30
CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-merge-gate-up-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.454 4507.90 3.697 138.47
2048 512 2048 0.399 5137.54 3.745 136.72
2048 512 4096 0.389 5261.75 3.782 135.37
2048 512 6144 0.395 5185.13 3.814 134.24
2048 512 8192 0.403 5086.56 3.862 132.57
2048 512 10240 0.408 5017.28 3.915 130.79
2048 512 12288 0.414 4941.87 3.952 129.55
2048 512 14336 0.422 4855.75 4.006 127.80
2048 512 16384 0.427 4795.48 4.058 126.16
2048 512 18432 0.433 4731.85 4.091 125.16
2048 512 20480 0.437 4686.27 4.154 123.26
2048 512 22528 0.444 4615.27 4.234 120.92
2048 512 24576 0.450 4549.59 4.267 120.00
2048 512 26624 0.456 4487.65 4.291 119.31
2048 512 28672 0.463 4420.85 4.315 118.65
2048 512 30720 0.469 4370.51 4.342 117.92
2048 512 32768 0.475 4307.13 4.379 116.91
CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-merge-gate-up-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate -muge
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 0.457 4481.63 3.707 138.11
2048 512 2048 0.399 5128.53 3.741 136.87
2048 512 4096 0.388 5284.00 3.775 135.64
2048 512 6144 0.394 5202.74 3.790 135.10
2048 512 8192 0.401 5112.52 3.829 133.73
2048 512 10240 0.406 5048.00 3.880 131.95
2048 512 12288 0.412 4971.28 3.926 130.43
2048 512 14336 0.419 4892.28 3.974 128.82
2048 512 16384 0.424 4831.77 4.030 127.05
2048 512 18432 0.431 4748.12 4.085 125.34
2048 512 20480 0.437 4686.52 4.151 123.33
2048 512 22528 0.444 4610.63 4.227 121.12
2048 512 24576 0.450 4547.48 4.280 119.63
2048 512 26624 0.457 4481.97 4.305 118.93
2048 512 28672 0.463 4420.30 4.338 118.01
2048 512 30720 0.468 4372.18 4.378 116.96
2048 512 32768 0.477 4295.49 4.409 116.14

@ikawrakow ikawrakow merged commit 7fab617 into main Mar 13, 2026
@abc-nix
Copy link
Contributor

abc-nix commented Mar 13, 2026

In my experience, running -sm graph -muge for hybrid GPU+CPU inference (with tensor offloading) of Qwen 3.5 397B IQ4_KSS from ubergarm makes the output fall into loops. Flags used:

      -c 210000 \
      --jinja \
      -fa 1 -ngl 99 -ub 4096 -b 8192 \
      --ctx-checkpoints 12 --ctx-checkpoints-interval 16383 \
      -cuda fusion=1,offload-batch-size=4,mmq-id-size=128 \
      -gr -ger \
      --split-mode graph -smgs --graph-reduce-type f32 \
      -muge \
      -ts 35,26 \
      -ot "blk\.([0-2])\.ffn_(up|gate|down)_exps\.weight=CUDA0" \
      -ot "blk\.([5][7-9])\.ffn_(up|gate|down)_exps\.weight=CUDA1" \
      -ot "blk\.([0-9]|[1-9][0-9])\.ffn_(up|gate|down)_exps\.weight=CPU" \
      --no-warmup --no-mmap

If I don't use -muge I get proper outputs.

I suppose it is due to the "unexpected results if using custom tensor offloads with split-mode graph" warning. I am thankful that it still works very well without merging up and gate expert tensors, so this is just a drawback of using custom tensor offloading.

@ubergarm
Copy link
Contributor

ubergarm commented Mar 13, 2026

@abc-nix

I'm not 100% sure, but digging through some recent PRs on imatrix fused up|gate tensors and the original -muge PR it may be that when you use -muge that your -ot needs to change from ffn_(up|gate|down)_exps to be ffn_(gate_up|down)_exps ... might give that a try?

Same thing if you're using a mainline pre-merged quant...

ik was gracious and re-named the existing convention here to reduce confusion with the new opposite naming convention on mainlin...

so it is ffn_gate_up_exps everywhere now psure, gonna go submit a PR to add this to --cpu-moe and --n-cpu-moe regex strings now

@abc-nix
Copy link
Contributor

abc-nix commented Mar 13, 2026

@ubergarm, thanks for the tips. This will help when using models with these experts merged (I will be using a catch-all regex ffn_(gate_up|up|gate|down)_exps).

What I am seeing when offloading the up and gate tensors with a non-merged gguf (the one in your repo), is that they are merged after they are loaded to the device. If I use -ot "blk\.([0-2])\.ffn_(gate_up|up|gate|down)_exps\.weight=CUDA0", I can see:

Tensor blk.0.ffn_up_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
merge_up_gate_exps: merging up/gate in layer 0
Tensor blk.0.ffn_down_exps.weight (size = 1096.00 MiB) buffer type overriden to CUDA0
Tensor blk.1.ffn_up_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
merge_up_gate_exps: merging up/gate in layer 1
Tensor blk.1.ffn_down_exps.weight (size = 1096.00 MiB) buffer type overriden to CUDA0
Tensor blk.2.ffn_up_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
merge_up_gate_exps: merging up/gate in layer 2
Tensor blk.2.ffn_down_exps.weight (size = 1096.00 MiB) buffer type overriden to CUDA0
[...]

So that is not the issue. The output still repeats itself, so using -muge with -sm graph and partial expert offloading is bad on my machine (unless there is a conflict with a different flag).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants